Memory Hierarchy Considerations for Fast Transpose and Bit-Reversals
نویسندگان
چکیده
This paper explores the interplay between algorithm design and a computer’s memory hierarchy. Matrix transpose and the bit-reversal reordering are important scientific subroutines which often exhibit severe performance degradation due to cache and TLB associativity problems. We give lower bounds that show for typical memory hierarchy designs, extra data movement is unavoidable. We also prescribe characteristics of various levels of the memory hierarchy needed to perform efficient bit-reversals. Insight gained from our analysis leads to the design of a near optimal bit-reversal algorithm. This Cache Optimal Bit Reverse Algorithm (COBRA) is implemented on the Digital Alpha 21164, Sun Ultrasparc 2, and IBM Power2. We show that COBRA is near optimal with respect to execution time on these machines and performs much better than previous best known algorithms. Copyright 1998 IEEE. Published in the Proceedings of HPCA 5, 9-13 January 1999 in Orlando, FL. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966. fkgatlin, [email protected].
منابع مشابه
An Efficient in-Place 3D Transpose for Multicore Processors with Memory Managed Memory Hierarchy
3D transpose is an important operation in many large scale scientific applications such as seismic and medical imaging. This paper proposes a novel algorithm for fast in-place 3D transpose operation. The algorithm exploits SIMD multicore architecture with software managed memory hierarchy. Such architectural features are present in the next generation processors, such as the Cell BE processor. ...
متن کاملArea-Speed-Efficient Transpose-Memory Architecture for Signal-Processing Systems
This paper presents the design and analysis of a high-speed implementation of a new transpose memory architecture. The proposed memory structure achieves almost 4X improvement in speed while consuming 46% less area, compared to prior work. For example, an 8X8 transpose memory with 12-bit input/output resolution has been implemented in 140 slices on a Virtex-7 Xilinx FPGA platform, achieving 107...
متن کاملFast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors
In this paper, we examine different methods using techniques of blocking, buffering, and padding for efficient implementations of bit-reversals. We evaluate the merits and limits of each technique and its application and architecture-dependent conditions for developing cache-optimal methods. Besides testing the methods on different uniprocessors, we conducted both simulation and measurements on...
متن کاملTowards an Optimal Bit-Reversal Permutation Program
The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bit-reversal permutation – trivial operations on a RAM – present non-trivial problems when designing highly-tuned scientific library functions, particular for the F...
متن کاملA Portable 3D FFT Package for Distributed-Memory Parallel Architectures
1 I n t r o d u c t i o n Multidimensional FF’I’s are used frequently in engineerillg and scientific calculations, especially in image processing. Parallel implementations of FFT generally follow two approaches. One is the binary-exchange approach[l ,2], where data exchanges take place in all pairs of processors with processor numbers differing by one bit. Another one is the transpose approach[...
متن کامل